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Abstract 

Background: Mycobacterium abscessus is a rapidly growing mycobacterium tliat is often associated witli liuman 
infections. Tine taxonomy of tliis species lias undergone several revisions and is still being debated. In this study, we 
sequenced the genomes of 12M abscessus strains and used phylogenomic analysis to perform subspecies 
classification. 

Results: A data mining approach was used to rank and select informative genes based on the relative entropy 
metric for the construction of a phylogenetic tree. The resulting tree topology was similar to that generated using 
the concatenation of five classical housekeeping genes: rpoB, hsp65, secA, recA and sodA. Additional support 
for the reliability of the subspecies classification came from the analysis of erm41 and ITS gene sequences, 
single nucleotide polymorphisms (SNPs)-based classification and strain clustering demonstrated by a variable 
number tandem repeat (VNTR) assay and a multilocus sequence analysis (MLSA). We subsequently found that 
the concatenation of a minimal set of three median-ranked genes: DNA polymerase III subunit alpha (po/C), 
4-hydroxy-2-ketovalerate aldolase (Hoa) and cell division protein FtsZ (faZ), is sufficient to recover the same 
tree topology. PGR assays designed specifically for these genes showed that all three genes could be amplified in 
the reference strain of M abscessus ATCC 19977^ 

Conclusion: This study provides proof of concept that whole-genome sequence-based data mining approach can 
provide confirmatory evidence of the phylogenetic informativeness of existing markers, as well as lead to the discovery 
of a more economical and informative set of markers that produces similar subspecies classification in M. abscessus. The 
systematic procedure used in this study to choose the informative minimal set of gene markers can potentially be 
applied to species or subspecies classification of other bacteria. 



Background 

The rapidly growing mycobacteria (RGM), defined as 
mycobacteria that grow in culture media within seven 
days, are mostly environmental organisms, some of 
which have emerged as important opportunistic patho- 
gens in humans, colonizers in the airway [1] or contami- 
nants on surgical instruments [2]. More than a hundred 
species have so far been identified [3]. As different spe- 
cies and their subgroups are known to differ in their 
pathogenicity and susceptibility to antibiotics [4,5], their 
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rapid and accurate identification is important for making 
therapeutic decisions in patient management. Unfortu- 
nately, although many taxonomic advances have been 
made in recent years, resulting in the creation of new 
species, subspecies and subgroups, the taxonomic status 
of some members is still unresolved. 

Mycobacterium abscessus is generally regarded as the 
most important RGM associated with human infections 
that range from localized cutaneous inflammation to 
rare but serious disseminated sepsis [6,7]. It is respon- 
sible for more than 80% of the chronic lung diseases 
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caused by RGM, some of which require surgical resec- 
tion of the infected lung for complete resolution of 
symptoms [8]. This organism shows many similarities 
with another RGM, M. chelonae, and was once classified 
as M, chelonae subspecies abscessus [9]. With DNA-DNA 
hybridization studies, however, it became recognized as a 
distinct species [10,11]. Subsequently, the species was 
divided into three subspecies: M, abscessus sensu 
stricto, M, massiliense and M, bolletii [12-14] but the 
most recent classification describes just two subspecies 
which are M, abscessus subspecies abscessus (formerly 
M, abscessus sensu stricto) and M, abscessus subspe- 
cies bolletii that includes the previous two subspecies 
of M, bolletii and M, massiliense [15]. However, in 
many phylogenetic and phylogenomic studies, M, 
abscessus is clearly separated into three subgroups 
[16-18]. CUnical and population studies have also indi- 
cated notable differences among the three subspecies 
[16,17,19]. Hence, in this paper, we use the former 
three subspecies classification for ease of reference to 
earlier publications. 

Traditionally, species identification for RGM is based 
on biological and biochemical tests such as pigment pro- 
duction, 3-day arylsulfatase reaction, nitrate reduction, 
iron uptake, and tolerance to 5% NaCl. Antibiotic sus- 
ceptibility has also been used to assist in species or sub- 
species classification. For instance, polymyxin B inhibits 
M, fortuitum but not M, abscessus [20], while in the M, 
abscessus group, M, massiliense can be differentiated 
from the other two subspecies by a truncated erythro- 
mycin ribosome methyltransferase {erm 41) gene associ- 
ated with susceptibility to macrolide antibiotics [19]. 
Although easy to use, the accuracy of these tests can be 
easily affected by the phenomenon of horizontal gene 
transfer (HGT) and differential gene expression. There- 
fore, phenotypic tests have largely been replaced by 
modern genotypic methods. 

The PCR-Restriction Enzyme analysis (PRA) is one of 
the most commonly used genotypic methods. Direct se- 
quencing of PGR amplified products based on the poly- 
morphism of housekeeping genes has also become 
available in many diagnostic laboratories. In both 
methods, the hsp65 gene, which is highly conserved 
within species, is frequently used. Most of the RGM 
currently known can be identified from RE patterns 
generated from a 439 bp variable portion of the gene 
[21,22]. In PCR-sequencing, the hsp65 gene shows 
more variability among RGM species than the 16S 
rRNA gene, the universally conserved gene chosen for 
the phylogenetic analysis of prokaryotes. Thus, it is 
better than the latter for the identification of closely- 
related mycobacterial species [23]. Other housekeeping 
genes that have been used in RGM studies, albeit at 
lower frequency, include the internal transcribed spacer 



(ITS) of the 16S-23S rRNA gene [24], the rpoB [25], 
sodA [26], secA [16]. recA [27] and gyrB [28] genes. 
Commercially available reverse line probe hybridization 
assays (INNO-LIPA Mycobacteria, Innogenetics, Ghent, 
Belgium; GenoType Mycobacterium CM/AS, Hain Life- 
science Gmbh, Germany) are increasingly being used as 
convenient molecular tools for rapid subspecies identifica- 
tion in diagnostic laboratories. 

The conventional phylogenetic approach uses single or 
limited numbers of genes to infer phylogenetic relation- 
ships among the taxa of interest. Classification of taxa 
can then be done on the basis of the topology of the in- 
ferred phylogenetic tree. A potential drawback of this 
method is that the optimality of phylogenetic signal in 
the genes used, as quantified using some suitable metric, 
is largely unknown because there is no systematic pro- 
cedure to find suitable candidates from a population of 
genes. The recent introduction of next-generation se- 
quencing has made it possible to use whole- genome se- 
quence information for subspecies classification via 
phylogenomic clustering. An example is the phyloge- 
nomic reconstruction of lactic acid bacteria by Zhang 
et al, [29]. The purpose of this study is to test the phylo- 
genetic informativeness of five classical housekeeping 
genes: rpoB, hsp65, secA, recA and sodA that are com- 
monly used in M, abscessus subspecies classification. 
Concurrently we wish to propose a phylogenomic ap- 
proach to identify, to the subspecies level, 12 M absces- 
sus strains isolated from clinical samples, using a smaller 
set of genes that is at least as informative as the five 
classical genes for the purpose of inferring a high- 
confidence tree topology. 

Results 

Genome assembly, annotation and identification of 
orthologs 

The draft genomes of our 12 M, abscessus strains, as- 
sembled using CLC Genomics Workbench and anno- 
tated using RAST, showed genome sizes ranging from 
4,802,413 bp to 5,488,620 bp with an average of 
5,054,711 bp. The number of putative Coding Sequences 
(CDS) ranged from 4,709 to 5560 (Additional file 1: 
Table SI). With the use of BLASTClust on the putative 
CDS from all genomes including those of the three ref- 
erence strains and the three outgroup species, 210 
orthologous genes were identified. 

Single and multiple gene-based subspecies classification 

On comparing the phylogenetic trees constructed 
using the five common marker genes for M. abscessus 
subspecies classification, the single gene- based trees 
inferred using rpoB, hsp65 and secA genes showed 
clear and consistent classification of each isolate into 
subspecies (Figure lA-C). The recA gene, however. 
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Figure 1 Phylogenetic inference using classical genes. The trees were constructed using the maximum lil<elihood method, with 1000 
bootstrap replicates. Only bootstrap values at important branches are shown. All the trees were inferred based on Tamura 3-parameter model 
with complete deletion option but at different substitution rates. (A) /7sp65-based tree (Uniform rate), (B) rpofi-based tree (Gamma rate), 
(C) secA-based tree (Gamma rate), (D) recA-based tree (Gamma rate), (E) sodA-based tree (Uniform rate), (F) supermatrix tree (Gamma rate). 



gave a slightly different classification (Figure ID). For in- 
stance, the two reference strains, M. bolletii BD^ and M. 
massiliense CCUG 48898^ were grouped together and 
strain M24 was grouped with M. abscessus ATCC 19977^, 
instead of M. bolletii BD^ as in the rpoB-, hsp6S- and 
5ecA-derived trees. The sodA -based tree was also different 
in that strain Ml 15 was identified as M, abscessus sensu 
stricto instead of M, massiliense as observed in the other 
trees (Figure IE). These results indicate that sodA and 
recA marker genes, which are commonly used for myco- 
bacterial classification, may not be suitable for M, absces- 
sus subspecies identification. 

The supermatrix tree obtained from the concatenation 
of the five genes has also revealed three clear subspecies 
grouping supported by strong bootstrap values of 93% 
for M, massiliense, 99% for M. bolletii and 100% for M. 
abscessus sensu stricto (Figure IF). The classification for 
our 12 strains was similar to those obtained with the 
rpoB, hsp6S and secA genes. 

Phylogenomic approach to subspecies classification 

The availability of whole-genome sequences enabled us to 
systematically evaluate the phylogenetic informativeness 



of a large pool of candidate genes and then select opti- 
mal ones for phylogenetic tree construction. For this 
evaluation, we used the same procedure to predict the 
Open Reading Frames (ORFs) in the 12 genomes se- 
quenced in this study, as well as the genomes of the 
reference and outgroup species. On aligning the se- 
quences of the 210 orthologous genes identified with 
BLASTClust, we found the MSA produced using MAFFT 
[30] to be more robust than that produced using 
MUSCLE [31]. Subsequently, we chose the MAFFT- 
aligned MSA for downstream analysis. The phylogenetic 
informativeness of each gene was assessed and ranked 
based on calculated entropy values as described in the 
Methods section. By using this systematic approach, we 
selected 50 median-ranked genes which would be the 
most phylogenetically optimal for the purpose of subspe- 
cies phylogeny inference. The functional analysis of these 
50 median-ranked genes, as well as the 50 top-ranked and 
50 bottom ranked genes showed that the majority of the 
bottom-ranked genes are involved in translational pro- 
cesses, whereas those of the top-ranked genes are mostly 
involved in enzyme metabolism (Figure 2; Additional file 1: 
Table S2-S4). These results indicate that the ranking 
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Figure 2 Functional analysis of orthologs. (A) COG classification composition of genes in tine top, median and bottom-ranl<ed genes. (B) COG 
classification of genes in the top, median and bottom-ranked genes. The highest relative entropy is found in the left-most member of the 
top-ranked genes; the lowest entropy is found in the right-most member of the bottom-ranked genes. The set of minimal genes capable of 
reconstructing the clustering topology as the one obtained using 50 median-ranked genes is boxed. 



obtained using the relative entropy metric appears to be 
biologically reasonable. 

In the phylogenomic tree constructed based on the se- 
lected 50 median-ranked genes (Figure 3A), the classification 



of isolates inferred agreed with the one inferred using the 
five concatenated classical housekeeping genes. Strains M93 
to M94 were clustered with M. abscessus sensu stricto; M24 
with M bolletii and the others with M massiliense. 
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(B) Suptrmatrix tree of three genes 




Figure 3 Supermatrix phylogenomic trees (General Time 
Reversible model with Gamma substitution rate, include all 
codon positions, complete deletion option). (A) Fifty-gene tree. 
(B) Three-gene tree. Phylogenetic trees generated from the 
concatenation of the selected genes, constructed with the maximum 
lil<elihood method with boostrap values at important branches (1,000 
bootstrap replicates). As shown in (B), the concatenation of three 
genes is sufficient to construct a phylogenetic tree showing subspecies 
classification similar to that in the supermatrix tree of fifty genes. As 
anticipated, of the 12M. abscessus isolates analysed, only M24 is 
classified as M. bolletii, supported with 100% bootstrap support. 
M93, M94 and Ml 52 are classified as M. obscessus sensu stricto 
and others as M. massiliense. 



To find a minimal set of genes capable of producing 
the same subspecies classification as that obtained using 
the 50 median-ranked genes, we iteratively removed 
genes from the left and right ends of the list of median- 
ranked genes, and used the remaining genes to construct 
the phylogenetic tree. The process was terminated 



when the classification implied by the tree topology 
changed. Thus, we found that the minimum number of 
median-ranked genes needed for recovering similar 
subspecies classification (Fig ure 3B) attained using the 
50 median-ranked genes was three. These three genes 
were DNA polymerase III subunit alpha {polC; Acces- 
sion: YP_001703430.1), 4-hydroxy-2-ketovalerate aldolase 
{Hoa; Accession: YP_001701378.1) and cell division pro- 
tein FtsZ (ftsZ; Accession: HQ662067.1). We confirmed 
the amplifiability of informative fragments from these 
three genes in M. abscessus ATCC 19977^ (Additional file 
1: Figure SI; Additional file 1: Table S5). 

Evidence to support the subspecies classification by 
phylogenomics analysis 

Examination of erm41 and 16S-23S rRNA ITS genes 

Previous studies have shown that the erm41 and the 
16S-23S rRNA ITS genes could be used to differentiate 
the subspecies of M abscessus [12,19]. We examined the 
sequences of these genes in all 12 strains and found 
seven strains (M156, M148, M154, M172, M152, Ml 15 
and Ml 8) having deletions at the 64^^ and 65^^ positions 
and a large deletion from position 159^^ to 432^*^ in the 
erm41 gene (Figure 4A), as well as an A to G substitu- 
tion at position 60* and insertion of C at position 102"^*^ 
in the ITS gene (Figure 4B). Since these features are typ- 
ically found in M. massiliense, their presence in the 
seven strains further supported the classification of these 
strains as M. massiliense, as observed in the supermatrix 
analyses. However, Ml 39 which was classified as M, 
massiliense in all the single gene-based and multiple 
gene-based approaches did not have these expected 
features. 

Classification of other /W. abscessus strains from public 
database 

At the end of our study, many M, abscessus genome se- 
quences have been deposited in NCBI Genbank data- 
base. To further test the reliability of the three selected 
gene markers, we used them to classif)^ 43 M, abscessus 
genomes (12 from our laboratory; 28 deposited by other 
researchers in the NCBI Genbank database; three refer- 
ence) and three outgroup genomes (Additional file 1: 
Table S6). The phylogenetic tree constructed using the 
minimal set of three genes clearly showed three major 
groups corresponding to the three M, abscessus sub- 
species (Figure 5). For example, the strains from 
6G0125S to 4S0726RB were grouped into the M, absces- 
sus sensu stricto cluster, whereas the strains from 47J26 to 
5S0422 were likely to be M, massiliense. None of the add- 
itional 28 strains were identified as M, bolletii. This classi- 
fication was in accordance with and supported by the 
analysis of hsp6S and rpoB gene sequences (Additional file 
1: Figure S2). The addition of 28 strains into this study 
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Figure 4 Multiple alignments of erm41 and 16S-23S rRNA ITS gene sequences. (A) Multiple alignment of erm41 sequences revealed 
deletions at positions 64^^ and 65^^ in addition to a large deletion (boxed in blue) in all M. massiliense strains, except Ml 39. (B) Multiple 



alignment of 16S-23S rRNA ITS gene sequences showed A to G substitution at position ( 
in all M. massiliense strains except for Ml 39. 



' and insertion of C at position 102" (boxed in blue) 
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Figure 5 Phylogenetic tree based on three marker genes, for 43 M. abscessus strains and tiiree outgroups (Tamura 3-parameter 
model withi Gamma substitution rate, include all codon positions, complete deletion option). Of the 28 additional M. abscessus 
strains, tliis pliylogenomics analysis revealed 19 strains as M. abscessus sensu stricto and 9 strains as M. massiliense, supported by 97% to 
98% bootstrap support. 



did not change the classification of the 12 strains from our 
initial analysis. These results again support our belief that 
the three selected genes are reliable markers for the classi- 
fication of M. abscessus subspecies. 

Other supporting evidence 

Recently, Wong et al [32] showed that a variable- 
number tandem-repeat (VNTR) typing assay for the 
M, abscessus species seemed to give good differenti- 
ation of M, abscessus strains. In their study, both the 
VNTR assay and a multilocus sequence analysis assay 
(MLSA) based on seven housekeeping genes {argH, 
cya, glpKy gndy murC, pta, and purH) yielded subspecies 



clustering that is concordant (with the exception of Ml 39) 
with the one inferred using the present phylogenomics 
approach. 

In addition, the availability of whole genome sequences 
of the 12 M. abscessus strains allowed us to proceed with 
the reconstruction of a phylogenetic tree using single 
nucleotide polymorphisms (SNPs) in the genomic re- 
gions that are conserved across all strains (Additional 
file 1: Figure S3). The resulting SNP-based tree showed 
that the classification of the 12 M abscessus strains is 
similar to the classification obtained from the phyloge- 
nomics tree generated using our systemic data mining 
approach. 
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Figure 6 Classification of strains according to different methods. Color legend: Yellow for M. abscessus sensu stricto; red for M. massiliense; 
green for M. bolletii; blue for M. obscessus sensu stricto//W. boletti; orange for M. obscessus sensu stricto/M massiliense. Abbreviation: SM: 
supermatrix; PG: phylogenomics; SNP: single-nucleotide polymorphism. 



The classification results of the 12 sequenced genomes 
using different approaches are summarized in Figure 6. 

Discussion 

Using a proposed phylogenomic approach, we have suc- 
cessfully classified into subspecies, 12 sequenced ge- 
nomes of M. abscessus strains isolated from clinical 
samples. The primary value of this approach is the use 
of a set of objectively selected genes for constructing a 
high confidence tree topology that can serve as a bench- 
mark. Subsequently, the classification implicit in the top- 
ology of the inferred tree can be used to guide the 
selection of a minimal gene set for practical, routine use. 
Existing methods (multiple-genes, MLSA) are not able 
to produce such benchmarks, and their resulting classifi- 
cation accuracy may be largely a matter of luck rather 
than science. Corroboration from the analysis of whole- 
genome SNP data further convinces us of the usefulness 
of the phylogenomic approach to systematically discover 
a minimal gene set for subspecies classification, at least 
in M. abscessus. 

Our phylogenomic approach uncovered a set of genes 
detected using median-ranking that supports the sub- 
species classification inferred using the concatenation of 
the five classical genes. The set of median-ranked genes 
has good coverage of all important functional classes 
compared to the top and bottom-ranked genes, so using 



their concatenation for phylogenetic inference poten- 
tially avoids systematic biases that can be caused by hav- 
ing a concentration of genes in the ranked list for some 
particular functional class. Further refinement resulted 
in a new, three-member gene set that preserves the simi- 
lar classification. The three selected genes are biologic- 
ally interesting and come from three different functional 
classes. The KEGG database [33] shows that DNA poly- 
merase III subunit alpha (polC) functions in DNA repli- 
cation; 4-hydroxy-2-ketovalerate aldolase is involved in 
phenylalanine metabolism, degradation of benzoate, di- 
oxin and xylene; and cell division protein gene ftsZ pro- 
duces the FtsZ GTP-binding cell division protein which 
is important for the cytoskeleton formation during cell 
wall synthesis. A brief survey of available literature 
shows that polC has been used as a phylogenetic marker 
in Bacillus subtilis subspecies classification [34] as well 
as lactic acid bacteria phylogeny [29]. In Wolbachia 
phylogenetic studies, ftsZ has been used in conjunction 
with other genes [35,36]. 

In this study, we also assessed a set of genes com- 
monly used by other workers for the subspecies classifi- 
cation of M, abscessus {rpoB, hsp6Sy recA, secA, sodA, 
erm41 and ITS) and found inconsistent classification 
for certain strains with recA, sodA, erm41 and ITS 
genes. It is well-known that the single gene-based ap- 
proach is sensitive to the evolutionary history of the 
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gene and not necessarily of the species, and may be 
limited by the lack of sufficient variation in a single 
gene sequence [37-39]. A multiple gene-based approach, 
if the genes are carefully chosen, may amplif)^ the individ- 
ual phylogenetic signals present in the genes and lead to 
an accurate inference of phylogenetic tree for subspecies 
classification [40,41]. This is seen in the superior robust- 
ness of our supermatrix trees whether they are based on 
the concatenation of five classical genes, core genome 
SNPs or our three gene markers. 

Of the 12 strains we isolated. Ml 39 had the most am- 
biguous taxonomic identity. It was classified as M. mas- 
siliense in the current study and also in two recent 
publications [17,18] using our M, abscessus genome se- 
quences for comparative analyses. However, it had fea- 
tures of M, abscessus sensu stricto in the erm41 and ITS 
genes and was shown to cluster with M, abscessus in 
the MLS and VNTR assays reported by Wong et al, 
[32]. Additionally, it was identified as M, abscessus 
subsp. abscessus in the MALDI Biotyper system (Bruker, 
Germany) that uses the MALDI-TOF (matrix-assisted 
laser desorption ionization-time of flight) mass spectrom- 
etry to identify organisms by their molecular (protein) fin- 
gerprints. The erm41 gene in M139 is intact but the 
minimum inhibitory concentration (MIC) of clarithromy- 
cin obtained with the Epsilometer test (bio Merieux, 
France) was 0.094 mg/1 even after prolonged incubation 
up to 14 days. This high level of in vitro susceptibility is 
more often seen among M, massiliense than M, abscessus 
or M. bolletii strains. In Ml 39, however, this low MIC can 
be attributed to the C28 polymorphism in the erm41 gene 
which has been reported to prevent the expression of the 
inducible clarithromycin resistance conferred by an intact 
erm41 gene [42]. 

The apparent incongruities in the subspecies classifi- 
cation of M139 could be because previous studies used 
insufficient numbers of strains to describe "typical" 
features associated with the different M, abscessus sub- 
species. More extensive sampling of M, abscessus pop- 
ulations from different geographical locations and clinical 
settings would help to resolve current taxonomic 
uncertainties. 

Conclusion 

Whole-genome sequencing of medically important bac- 
terial strains opens up new possibility for accurate sub- 
species classification via a phylogenomic approach. We 
have shown how a large pool of orthologs can be iden- 
tified bioinformatically from the genomes of 12M. 
abscessus isolates, assembled from fragments generated 
using Illumina shotgun sequencing. We then intro- 
duced a method to rank the phylogenetic informative- 
ness of the identified orthologs using the relative 
entropy metric. Based on the set of 50 median-ranked 



genes, a benchmark phylogeny was obtained, and we 
used its topology to infer subspecies classification. We 
made the method practical by discovering a three- 
member minimal gene set that could return the same 
subspecies classification obtained using 50 median- 
ranked genes. The workflow here provides an objective 
means for the development of molecular-based classifi- 
cation method that was not possible with previous 
methods such as single gene, arbitrary multiple gene 
approaches, VNTR as well as MLSA. Our three gene 
set is at least as informative as classical markers com- 
monly used by other researchers. Its applicability can 
be further evaluated by testing it on a larger popula- 
tion of strains, from different clinical settings and geo- 
graphical locations. 

Methods 

Bacterial strains 

The 12 M. abscessus strains used in this study were 
collected between July 2009 and June 2011 from the 
Clinical Microbiology Laboratory of University of Malaya 
Medical Centre (UMMC), Kuala Lumpur, Malaysia. 
With the exception of one strain from a lymph node, all 
other strains were from sputum or bronchoalveolar 
fluid of patients presenting with respiratory infections. 
In the UMMC laboratory, clinical specimens from pa- 
tients suspected to have tuberculosis are routinely cul- 
tured in the BACTEC MGIT960 liquid culture system 
(Becton-Dickinson). Positive cultures are examined micro- 
scopically and subcultured on Lowenstein-Jensen (LJ) 
slants. Acid-fast colonies on LJ slants are then identified 
using a variety of tests. The strains used in this study 
were identified as M, abscessus by their rapid growth 
(within seven days of incubation) in the MGIT960 sys- 
tem, non-pigmented acid-fast colonies on LJ slants, 
typical restriction pattern in a PCR-restriction frag- 
ment length polymorphism analysis [21] and a positive 
identification in a reverse line probe hybridization 
assay (GenoType Mycobacterium CM/AS; Hain Life- 
science GmbH, Germany). 

All isolates were kept in Middlebrook 7H9 broth 
with 15% glycerol, at -80°C, until required for further 
testing. 

For whole-genome sequencing, archived strains were re- 
trieved, rendered non-nominal and subcultured on LJ 
slants. This study involved only genomic analysis of rou- 
tine isolates and, with the exception of specimen type and 
geographical origin, none of the data used could expose 
patient identity. As such, we considered it unnecessary to 
apply for ethical approval by the University's Medical Eth- 
ics Committee Standard Operating Procedures (http:// 
www.ummc.edu.my /view/content.php?ID=56). 

The genome information of the reference strains was 
extracted from the NCBI Genbank [43] database. The 



Tan et al. BMC Genomics 2013, 14:879 
http://www.bionnedcentral.conn/1 471 -21 64/1 4/879 



Page 10 of 13 



reference strains used for comparison are given in 
Additional file 1: Table S6. 

DNA extraction & Isolate validation 

Pure cultures on LJ slants were harvested by flooding each 
slant with 3 ml of phosphate buffered saline (PBS, Ph 7.4) 
followed by gentle scraping of the agar surface with a glass 
Pasteur pipette. The suspension was then collected into a 
15 mL centrifuge tube and vortex mixed for 5 minutes. For 
DNA extraction, 200 [xL of suspension was processed 
with the ZR Fungal/Bacterial DNA MiniPrepTM (Zymo 
Research, USA) according to the protocol provided by the 
manufacturer. DNA concentration and purity were mea- 
sured on the NanoDrop2000 spectrophotometer (Thermo 
Scientific). To validate the identity of the 12 M. abscessus 
isolates, we used PGR to amplify their hsp6S genes for 
Sanger sequencing, and then used BLAST to query the 
sequences to known Mycobacterium species/subspecies 
gene sequences stored in a web-accessible database of 
hsp6S locus sequences [44]. All 12 strains achieved almost 
perfect or perfect (98% to 100%) similarity to M, abscessus. 

Whole-genome sequencing, annotation and Identification 
of orthologs 

The genomes of the 12 isolates were shotgun-sequenced 
using Illumina Genome Analyzer 2X technology. The raw 
sequencing reads were trimmed at a threshold of 0.01 and 
the sequences obtained were assembled de novo using GLG 
Genomics Workbench version 4.9 (GLG bio, Denmark). All 
target genes for analysis were extracted from the assembled 
genomes of the 12 M. abscessus strains. The gene 
sequences for the reference M, abscesuss subspecies were 
obtained from the NGBI database. We used BLAST [45] to 
locate the boundaries of the genes in the 12 assembled draft 
genomes. 

We annotated the genomes of strains included in this 
analysis using Rapid Annotation running on Subsystem 
Technology (RAST) automated pipeline [46]. The ortho- 
logs among the putative open reading frames (ORFs) were 
identified using BLASTGlust in a standalone BLAST 
program [47]. 

Phylogenetic classification using classical markers 

The pan-bacterial 16S rRNA gene, which is generally con- 
sidered the best target for bacterial identification and phylo- 
genetic analysis, was not used in this study as it has been 
reported to lack discriminatory power for M, abscessus 
subspecies [48]. Instead, five other genes that have been 
used by other workers were chosen for phylogenetic 
analysis, namely, the rpoB (711 bp) gene encoding for the 
y5-subunit of RNA polymerase enzyme responsible for RNA 
synthesis [49]; the hsp6S (603 bp) gene for the 65 kDa heat 
shock protein that is involved in intracellular protein fold- 
ing, assembly and transport [50]; the sodA (539 bp) gene 



encoding superoxide dismutase, a metalloenzyme that con- 
tributes to defence against oxidative stress in mycobacteria 
[51]; the recA (1041 bp) recombinase gene related to DNA 
repair [52] and the secA (700 bp) gene that is involved with 
pre-protein transport in the bacterial cell [53]. 

Each of the five housekeeping gene sequences was 
aligned using GlustalX from the European Bioinformatics 
Institute (EBI) online server to study similarities and vari- 
able sites among the subspecies [54]. We then used MEGA 
version 5.10 [55] to select the best DNA substitution model 
for each multiple sequence alignment (MSA), followed by 
maximum likelihood estimation of trees. In addition, we 
concatenated all five genes, and similarly used the MSA 
of the latter to infer maximum likelihood trees (1,000 
bootstrap replications). 

Two additional genes, erm41 and 16S-23S rRNA ITS 
with known features for subspecies identification were used 
to fiarther support the classification results. M, massiliense 
can be recognized by the presence of a truncated erm41 
gene [19] as well as an A to G substitution at the 60th 
position and a G insertion at the 102nd position of the 
16S-23S rRNA /re [12]. 

Phylogenomic classification using whole-genome data 

Not all genes have the same phylogenetic signal [56]. The 
main advantage of the phylogenomic approach is that it 
allows the generation of a large population of orthologs in 
the taxa of interest so that an objective assessment of their 
phylogenetic informativeness can be made. These genes 
can be ranked accordingly based on some metric that is 
entropy-dependent for the purpose of quantifying phylo- 
genetic informativeness. Given a list of genes ranked from 
high to low entropy, we propose that a subset of genes in 
the neighbourhood of the median of this ranked list would 
be the most phylogenetically optimal for the purpose of 
subspecies phylogeny inference. If this method works, a 
simple fiinctional categorization of genes in the top, median 
and bottom-ranked genes should roughly show the follow- 
ing pattern: dominance of genes involved in the highly 
conserved translational process in the bottom-ranked 
genes, and dominance of genes involved in metabolism in 
the top-ranked genes. The subsequent phylogenetic tree 
constructed using the median-ranked genes would then 
constitute a high confidence phylogeny. 

To prepare the genes for ranking, we performed multiple 
sequence alignment (MSA) for the orthologs identified. 
Next, we removed all MSA with alignment gaps greater 
than 5% of the size of the alignment block. To reduce the 
chances of using genes that are not robust to choice of the 
MSA algorithm, we compared the results of MUSGLE and 
MAFFT. We chose these two algorithms because of posi- 
tive evaluations with respect to their accuracy and speed 
performance [57]. To exclude co-optimality of MSA as a 
potential source of alignment error, we further performed 
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the Heads or Tails (HoT) analysis [58] on the identified 
orthologs. In HoT analysis, an MSA using reversed se- 
quences was first built; this MSA was then reversed and 
then compared against the MSA built using the original, 
unreversed sequences. Only genes that show identical MSA 
after an HoT analysis were subsequently used for down- 
stream analysis. 

To quantify the phylogenetic signal of the ith gene fe) 
gene, we used the relative entropy metric for a gene s MSA: 

where s is the number of sequences, L is the length of the 
MSA, Pxy(nt) is the relative frequency of state nt (gaps are 
treated as a fifth character) in the entire MSA at row x and 
column J, with nt e {A, G, C, T,-}, Pxi^^i) is the row marginal 
frequency for state nt, and Py(nt) is the column marginal 
frequency for state nt This metric measures the divergence 
between the nucleotide frequencies in the entire MSA from 
the nucleotide frequencies in the rows and columns of the 
MSA. An MSA with noiseless columns would yield a zero 
value since the nucleotide frequencies in the columns 
would be degenerate, and those in the rows would be the 
same as the nucleotide frequencies in the MSA. For MSA 
with noisy columns, H increases. All computation was done 
using an in-house R script [59] (available upon request). 

Fifty genes - 25 to the left and right (in alternating order) 
of the list of median-ranked genes, were concatenated and 
used to reconstruct the phylogenetic tree of the 12 isolate 
strains. Taking the clustering of strains based on the 50 
median-ranked genes as the gold standard, we subsequently 
deduced the set of minimum median-ranked genes needed 
to recover the same clustering by iteratively reducing the 
number of median-ranked genes used and checking the 
resulting change in clustering topology. Candidate markers 
in this minimum set were then tested for their amplifiability 
using a standard PGR protocol to establish their practical 
usefulness as new, objectively inferred phylogenetic markers. 
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